Project Description

Floods are among the most destructive natural disasters. In the United States, natural disaster recovery is managed by (FEMA) Federal Emergency Managment Agency. One interesting aspect of floods is that they can occur almost anywhere. Assemble a dataset to investigate floods in 2020-2021. Use the data resources below to assemble your data. Clean and organize the data. Write an EDA report.

Questions to be answered: How dangerous are floods? How expensive? Is there any pattern to the kinds of communities that suffer losses from floods?

Datasets

There are three files in total. Datasets from NOAA, datasets from FEMA and census data.

NOAA: https://www.ncei.noaa.gov/pub/data/swdi/stormevents/csvfiles/ FEMA1: https://www.fema.gov/openfema-data-page/fema-web-disaster-summaries-v1 FEMA2: https://www.fema.gov/openfema-data-page/disaster-declarations-summaries-v2 Census Data: https://www.census.gov (somewhere inside)

Import NOAA Data

This report focuses on data from 2020 to 2021.

# Import Datasets from NOAA
detail2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_details-ftp_v1.0_d2020_c20230927.csv")
detail2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_details-ftp_v1.0_d2021_c20231017.csv")
location2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_locations-ftp_v1.0_d2020_c20231017.csv")
location2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_locations-ftp_v1.0_d2021_c20231017.csv")
fata2020 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_fatalities-ftp_v1.0_d2020_c20230927.csv")
fata2021 <- read.csv("/Users/thomas/Downloads/ma615/midterm/NOAA/StormEvents_fatalities-ftp_v1.0_d2021_c20231017.csv")

Merge detail2020 ,fata 2020 and location2020; detail2021, fata 2021 and location2021

The detail, location and fatality datasets can be merged into one single dataset due to some common columns. This project only focuses on “FLOOD” as a disaster type. There should be four types of flodds in total: “Flood”, “Lakeshore Flood”, “Flash Flood”, “Coastal Flood”.

# Validate keys
if(any(duplicated(detail2020$EVENT_ID)) | any(duplicated(detail2021$EVENT_ID))) {
  stop("Duplicate EVENT_IDs found in detail datasets.")
}

# Pre-process datasets (if necessary, based on your analysis needs)

# Merge in steps
combinedDetails <- rbind(detail2020, detail2021)
combinedLocations <- rbind(location2020, location2021)
combinedFatalities <- rbind(fata2020, fata2021)

combined <- merge(combinedDetails, combinedLocations, by = c("EVENT_ID", "EPISODE_ID"), all = TRUE)
combined <- merge(combined, combinedFatalities, by = "EVENT_ID", all.x = TRUE)

# Post-process merged dataset
combined <- combined[!duplicated(combined$EVENT_ID), ]

# Consistency check (based on your analysis needs)

# Define the types of floods to filter
floodTypes <- c("Flood", "Coastal Flood", "Flash Flood", "Lakeshore Flood")

# Filter for Flood events
combined <- combined[combined$EVENT_TYPE %in% floodTypes, ]

# Get unique incidents of flood and print them
unique_incidents <- unique(combined$EVENT_TYPE)
print(unique_incidents)
## [1] "Flood"           "Lakeshore Flood" "Flash Flood"     "Coastal Flood"

Clean NOAA Data

The “combined” dataset is the raw dataset that I have right now. This step is to remove some columns that are not helpful for EDA, and I am going to choose some columns that I am interested.

#colnames(combined)
combined <- combined[, -which(names(combined) %in% c("WFO", "CZ_TIMEZONE", "SOURCE","TOR_WIDTH", "TOR_LENGTH", "TOR_OTHER_CZ_FIPS", "TOR_OTHER_CZ_NAME", "TOR_OTHER_CZ_STATE", "TOR_F_SCALE","CZ_TYPE","CZ_FIPS","CZ_NAME","EVENT_NARRATIVE","EPISODE_NARRATIVE","END_AZIMUTH","END_RANGE","BEGIN_RANGE","BEGIN_AZIMUTH","LAT2","LON2","RANGE","AZIMUTH","LOCATION_INDEX","STATE_FLIPS"))]
colnames(combined)
##  [1] "EVENT_ID"          "EPISODE_ID"        "BEGIN_YEARMONTH"  
##  [4] "BEGIN_DAY"         "BEGIN_TIME"        "END_YEARMONTH"    
##  [7] "END_DAY"           "END_TIME"          "STATE"            
## [10] "STATE_FIPS"        "YEAR"              "MONTH_NAME"       
## [13] "EVENT_TYPE"        "BEGIN_DATE_TIME"   "END_DATE_TIME"    
## [16] "INJURIES_DIRECT"   "INJURIES_INDIRECT" "DEATHS_DIRECT"    
## [19] "DEATHS_INDIRECT"   "DAMAGE_PROPERTY"   "DAMAGE_CROPS"     
## [22] "MAGNITUDE"         "MAGNITUDE_TYPE"    "FLOOD_CAUSE"      
## [25] "CATEGORY"          "TOR_OTHER_WFO"     "BEGIN_LOCATION"   
## [28] "END_LOCATION"      "BEGIN_LAT"         "BEGIN_LON"        
## [31] "END_LAT"           "END_LON"           "DATA_SOURCE"      
## [34] "YEARMONTH"         "LOCATION"          "LATITUDE"         
## [37] "LONGITUDE"         "FAT_YEARMONTH"     "FAT_DAY"          
## [40] "FAT_TIME"          "FATALITY_ID"       "FATALITY_TYPE"    
## [43] "FATALITY_DATE"     "FATALITY_AGE"      "FATALITY_SEX"     
## [46] "FATALITY_LOCATION" "EVENT_YEARMONTH"
Check how dengerous flood is

I am going to use some columns for later analysis. To be noticed, I have converted the incident begin date here (where incident corresponds to any flood).

dangerousflood <- combined[, c("EVENT_ID", "EPISODE_ID", "STATE", "YEAR", "MONTH_NAME", "EVENT_TYPE","INJURIES_DIRECT", "INJURIES_INDIRECT", "DEATHS_DIRECT", "DEATHS_INDIRECT",  "DAMAGE_PROPERTY", "DAMAGE_CROPS", "FLOOD_CAUSE", "CATEGORY","BEGIN_LOCATION", "END_LOCATION", "LATITUDE", "LONGITUDE","FATALITY_AGE", "FATALITY_SEX","FATALITY_LOCATION","BEGIN_DATE_TIME","MAGNITUDE_TYPE","MAGNITUDE")]

# Convert BEGIN_DATE_TIME to Date format
dangerousflood$BEGIN_DATE <- as.Date(dangerousflood$BEGIN_DATE_TIME, format = "%d-%b-%y %H:%M:%S")

# Check the first few dates to ensure they were converted correctly
head(dangerousflood$BEGIN_DATE)
## [1] "2020-01-01" "2020-01-01" "2020-01-01" "2020-01-11" "2020-01-11"
## [6] "2020-01-11"

Import census data

Well, I dont’ find this dataset particular useful for now.

#census 5225
meta2020_1_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2020.S1701-Column-Metadata.csv")
meta2020_2_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2020.S1701-Data.csv", skip = 1)
meta2021_1_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2021.S1701-Column-Metadata.csv")
meta2021_2_5225 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T135225/ACSST5Y2021.S1701-Data.csv")

#census 0133
meta2020_1_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2020.DP05-Column-Metadata.csv")
meta2020_2_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2020.DP05-Data.csv")
meta2021_1_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2021.DP05-Column-Metadata.csv")
meta2021_2_0133 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140133/ACSDP5Y2021.DP05-Data.csv")

#census 0147
meta2020_1_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2020.B25001-Column-Metadata.csv")
meta2020_2_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2020.B25001-Data.csv")
meta2021_1_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2021.B25001-Column-Metadata.csv")
meta2021_2_0147 <- read.csv("/Users/thomas/Downloads/ma615/midterm/Census Download_2023-10-23T140147/ACSDT5Y2021.B25001-Data.csv")

Import FEMA datasets

I found API endpoints from FEMA website and downloaded the data directly using API for both FEMA datasets.

#Import Datasets from FEMA v1
data_url <- "https://www.fema.gov/api/open/v1/FemaWebDisasterSummaries.json"

# Making the GET request to the data URL
response <- GET(url = data_url)

# Check the response status code
print(status_code(response))
## [1] 200
# Parse the content of the response into a dataframe
data <- fromJSON(rawToChar(response$content))

# Convert the data to a dataframe
floodv1 <- as.data.frame(data$FemaWebDisasterSummaries)

Time for another FEMA datasets, however the sample size is small. I will pause it for now.

# Import Datasets from FEMA v2
# Correctly encode the URL
data_url <- "https://www.fema.gov/api/open/v2/DisasterDeclarationsSummaries.json?$filter=incidentType%20eq%20%27Flood%27%20and%20(incidentBeginDate%20ge%20%272020-01-01T00:00:00.000Z%27%20and%20incidentBeginDate%20le%20%272021-12-31T23:59:59.000Z%27)"

# Making the GET request to the data URL
response <- GET(url = data_url)

# Check the response status code
print(status_code(response))
## [1] 200
# Parse the content of the response into a dataframe
data <- fromJSON(rawToChar(response$content))

# Convert the data to a dataframe
floodv2 <- as.data.frame(data$DisasterDeclarationsSummaries)

# Check the unique values in the incidentType column
unique_incidents <- unique(floodv2$incidentType)

# Print the unique incident types
print(unique_incidents)
## [1] "Flood"

Merge FEMA datasets

I tried to merge them into one dataset.

combinedFEMA <- floodv1 %>%
  inner_join(floodv2, by = "disasterNumber")

Deep cleaning NOAA dataset and EDA

I decided to use NOAA for my report as there are many insightful columns for analysis.

Glimpse NOAA - dangerousflood

# glimpse
glimpse(dangerousflood)
## Rows: 13,661
## Columns: 25
## $ EVENT_ID          <int> 863537, 863585, 863586, 864250, 864251, 864461, 8644…
## $ EPISODE_ID        <int> 143838, 143845, 143845, 143983, 143983, 144018, 1440…
## $ STATE             <chr> "WISCONSIN", "WISCONSIN", "WISCONSIN", "WISCONSIN", …
## $ YEAR              <int> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020…
## $ MONTH_NAME        <chr> "January", "January", "January", "January", "January…
## $ EVENT_TYPE        <chr> "Flood", "Flood", "Flood", "Lakeshore Flood", "Lakes…
## $ INJURIES_DIRECT   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ INJURIES_INDIRECT <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DEATHS_DIRECT     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DEATHS_INDIRECT   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ DAMAGE_PROPERTY   <chr> "96.00K", "0.00K", "0.00K", "10.70M", "0.00K", "50.0…
## $ DAMAGE_CROPS      <chr> "0.00K", "0.00K", "0.00K", "0.00K", "0.00K", "0.00K"…
## $ FLOOD_CAUSE       <chr> "Ice Jam", "Heavy Rain / Snow Melt", "Heavy Rain / S…
## $ CATEGORY          <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_LOCATION    <chr> "OCONTO", "KEWAUNEE", "OCONTO", "", "", "MATNEY", "S…
## $ END_LOCATION      <chr> "OCONTO", "KEWAUNEE", "OCONTO", "", "", "KELLERSVILL…
## $ LATITUDE          <dbl> 44.8939, 44.4607, 44.9116, NA, NA, 36.2495, 36.5023,…
## $ LONGITUDE         <dbl> -87.8492, -87.5019, -87.8422, NA, NA, -81.8980, -81.…
## $ FATALITY_AGE      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ FATALITY_SEX      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ FATALITY_LOCATION <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_DATE_TIME   <chr> "01-JAN-20 00:00:00", "01-JAN-20 00:00:00", "01-JAN-…
## $ MAGNITUDE_TYPE    <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", …
## $ MAGNITUDE         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ BEGIN_DATE        <date> 2020-01-01, 2020-01-01, 2020-01-01, 2020-01-11, 202…

EDA on States

I would like to know which state have the most counts of flood, top 5 states that have the highest frequency of being attacked by flood, and the property damage + crop damage for each state.

Flood counts for states

# Check all states
state_all <- dangerousflood |> group_by(STATE) |> count()
# if(sum(state_all$n) == dim(dangerousflood)[1])
  {print("Every row has value in the State column.")}
## [1] "Every row has value in the State column."
print(state_all)
## # A tibble: 55 × 2
## # Groups:   STATE [55]
##    STATE                    n
##    <chr>                <int>
##  1 ALABAMA                336
##  2 ALASKA                  25
##  3 AMERICAN SAMOA          11
##  4 ARIZONA                454
##  5 ARKANSAS               460
##  6 CALIFORNIA             367
##  7 COLORADO               147
##  8 CONNECTICUT             86
##  9 DELAWARE                38
## 10 DISTRICT OF COLUMBIA    29
## # ℹ 45 more rows
# The most counts of states
state_max <- state_all$STATE[which(state_all$n ==  max(state_all$n)  )]
print(state_max)
## [1] "VIRGINIA"
# Group by STATE and count the number of occurrences
state_all <- dangerousflood %>% 
  group_by(STATE) %>% 
  summarise(Count= n())
# Top counts of states
top_states <- state_all %>% 
  top_n(5, Count)
print(top_states)
## # A tibble: 5 × 2
##   STATE    Count
##   <chr>    <int>
## 1 KENTUCKY   696
## 2 MISSOURI   797
## 3 NEW YORK   611
## 4 TEXAS      751
## 5 VIRGINIA  1040
states_map <- map_data("state")

It appears that flood occurrences are not evenly distributed across the United States, with some states experiencing a higher frequency of events. Notably, Virginia stands out with a total of 1,040 reported flood incidents for the years 2020 and 2021. On average, Virginia experienced nearly three flood events every day and this high frequency highlights the significant and persistent challenge for government and people.

Missouri had 797 events in total. Kentucky, Texas, and New York also recorded 696, 751, and 611 occurrences respectively.

State vs Property and Crop Damages

Now it’s time to visualize the financial/agricultural damages of flood.

Property Damage
# Function to convert damage values to numeric
convert_damage_to_numeric <- function(damage) {
  if (is.na(damage) || damage == "") {
    return(NA)
  }
  # Remove commas, convert to upper case, and trim leading/trailing whitespace
  damage <- gsub(",", "", damage)
  damage <- toupper(trimws(as.character(damage)))

  # Convert based on suffixes
  multiplier <- 1
  if (grepl("K$", damage)) {
    multiplier <- 1e3
    damage <- gsub("K$", "", damage)
  } else if (grepl("M$", damage)) {
    multiplier <- 1e6
    damage <- gsub("M$", "", damage)
  } else if (grepl("B$", damage)) {
    multiplier <- 1e9
    damage <- gsub("B$", "", damage)
  }
  
  # Convert to numeric and multiply by the appropriate factor
  numeric_damage <- as.numeric(damage)
  if (!is.na(numeric_damage)) {
    return(numeric_damage * multiplier)
  } else {
    return(NA)
  }
}

# Apply the function to the DAMAGE_PROPERTY and DAMAGE_CROPS columns
dangerousflood$DAMAGE_PROPERTY_NUM <- sapply(dangerousflood$DAMAGE_PROPERTY, convert_damage_to_numeric)
dangerousflood$DAMAGE_CROPS_NUM <- sapply(dangerousflood$DAMAGE_CROPS, convert_damage_to_numeric)

# Now, summarize the total damage by state using the new numeric columns
state_damage <- dangerousflood %>%
  group_by(STATE) %>%
  summarise(Total_Damage_Property = sum(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
            Total_Damage_Crops = sum(DAMAGE_CROPS_NUM, na.rm = TRUE))

# Check the summary data frame
print(state_damage)
## # A tibble: 55 × 3
##    STATE                Total_Damage_Property Total_Damage_Crops
##    <chr>                                <dbl>              <dbl>
##  1 ALABAMA                            4985000                  0
##  2 ALASKA                            15429740                  0
##  3 AMERICAN SAMOA                      701000              81000
##  4 ARIZONA                           87238500                  0
##  5 ARKANSAS                           1548000                  0
##  6 CALIFORNIA                        22385000              18000
##  7 COLORADO                           2106710             425240
##  8 CONNECTICUT                        7110510                  0
##  9 DELAWARE                           3600000                  0
## 10 DISTRICT OF COLUMBIA                100000                  0
## # ℹ 45 more rows
# Plot total property damage by state
ggplot(state_damage, aes(x = reorder(STATE, -Total_Damage_Property), y = Total_Damage_Property)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Total Property Damage by State",
       x = "State",
       y = "Total Property Damage (USD)") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The data indicates that New York incurs the highest total property damage from floods (over 15000k), when measured in thousands of U.S. dollars, followed by Louisana, Michigan, and Pennsylvania. This finding is consistent with the high frequency of flooding events reported in New York. Those state are mostly in eastern side of the state, suggest a higher flood impact on the east.

Virginia has a low total property damage than the above states, and interestingly, Missouri, despite having a high number of flood incidents, does not correspond with similarly high levels of property damage. That might suggest an effective flood management by the state government or some location advantages that mitigate the cost.

Total Property Damage in a map visual
# Function to convert character damage values to numeric
convert_damage <- function(damage) {
  if (is.na(damage)) {
    return(NA_real_)
  }
  multiplier <- ifelse(grepl("K", damage), 1e3, ifelse(grepl("M", damage), 1e6, 1))
  as.numeric(gsub("[^0-9.]", "", damage)) * multiplier
}

# Apply the function to the DAMAGE_PROPERTY column
dangerousflood <- dangerousflood %>%
  mutate(DAMAGE_PROPERTY = sapply(DAMAGE_PROPERTY, convert_damage))

# Now let's try the summarise again
state_damage_totals <- dangerousflood %>%
  group_by(STATE) %>%
  summarise(Total_Damage_Property = sum(DAMAGE_PROPERTY, na.rm = TRUE)) %>%
  mutate(STATE = tolower(STATE)) # Ensure state names match map data

# Merge map data with total property damages
map_data_merged <- merge(states_map, state_damage_totals, by.x = "region", by.y = "STATE", all.x = TRUE)

# Plot with more visible colors
ggplot() +
  geom_polygon(data = map_data_merged, aes(x = long, y = lat, group = group, fill = Total_Damage_Property), color = "white") +
  scale_fill_continuous(name = "Total Property Damage (USD)", 
                        low = "lightblue1", high = "darkblue", 
                        na.value = "grey50", # Grey color for states with NA values
                        labels = scales::comma) + # Format labels with commas for readability
  labs(title = "Total Property Damage by State") +
  theme_void() +
  theme(legend.position = "right")

The distribution of flood-related property damage across the United States reveals a distinct pattern, with the eastern regions, sustaining the most significant financial impact.

The Middle to west and Northern states report relatively minimal damage, which might reflect less frequent flood events. Conversely, the Southern states experience severe property damage due to flooding. This severity could be due to a combination of factors, including the prevalence of hurricanes and tropical storms that bring intense rainfall. Also, coastal areas have a relatively high damage on property.

Crops Damage
# Log transform of total crop damage
ggplot(state_damage, aes(x = reorder(STATE, -Total_Damage_Crops), y = log(Total_Damage_Crops + 1))) +
  geom_bar(stat = "identity", fill = "cyan4") +
  labs(title = "Log of Total Crop Damage by State",
       x = "State",
       y = "Log of Total Crop Damage") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The distribution of flood-related crop damage across various states exhibits a pronounced right skew, thus I take a log transformation to make the plot easier to interpret.

However, it’s important to note that the absence of crop damage data for many states and I could not get those data. Apparently, the NOAA datasets have some ‘NA’ values where I should pay attention to. Despite this limitation, an examination of the available data reveals that states such as the Dakotas, Minnesota, and Iowa have sustained the most severe crop damage. This pattern may reflect these states’ considerable agricultural activities and their losses on agricultural lands due to flood.

Total Property Damage in a map visual
# Ensure the states_map data is loaded
states_map <- map_data("state")

# Apply the function to the DAMAGE_CROPS column
dangerousflood <- dangerousflood %>%
  mutate(DAMAGE_CROPS = sapply(DAMAGE_CROPS, convert_damage))

# Now let's summarize the crop damages by state
state_crop_damage_totals <- dangerousflood %>%
  group_by(STATE) %>%
  summarise(Total_Damage_Crops = sum(DAMAGE_CROPS, na.rm = TRUE)) %>%
  mutate(STATE = tolower(STATE)) # Ensure state names match map data

# Merge the states_map with the summarized crop damages
map_data_merged_crops <- merge(states_map, state_crop_damage_totals, by.x = "region", by.y = "STATE", all.x = TRUE)

# Now create the plot with the merged data
ggplot() +
  geom_polygon(data = map_data_merged_crops, aes(x = long, y = lat, group = group, fill = Total_Damage_Crops), color = "white") +
  scale_fill_continuous(name = "Total Crop Damage", 
                        trans = "log1p", # Log transformation to better visualize the range
                        low = "lightgreen", high = "darkgreen", 
                        na.value = "grey50", # Grey color for states with NA values
                        breaks = scales::trans_breaks("log1p", function(x) 10^x), # Adjust breaks for log scale
                        labels = scales::trans_format("log1p", scales::math_format(10^.x))) + # Adjust labels for log scale
  labs(title = "Total Crop Damage by State") +
  theme_void() +
  theme(legend.position = "right",
        legend.key.width = unit(1.5, "cm"), # Adjust legend key width
        legend.title = element_text(size = 10), # Adjust legend title size
        legend.text = element_text(size = 8)) # Adjust legend text size

Visual analysis of the crop damage data illuminates a notable concentration of severe impacts in the northern part of the United States. States like the Dakotas, Minnesota, and Iowa, which are characterized by extensive agricultural activity, appear to bear the brunt of the damage. These regions, often referred to as the nation’s breadbasket, may be particularly susceptible due to their reliance on large-scale farming and the potential for flooding to disrupt vast swathes of cropland.

In the South, Texas stands out with significant crop damage, possibly attributable to its size, diverse climate zones, and exposure to extreme weather events such as hurricanes, which can bring about widespread flooding. The damage in these southern areas, though less concentrated than in the North, still represents a substantial economic impact, highlighting the vulnerability of agricultural sectors to flooding across varied geographical landscapes.

Injuries & Deaths

Investigating the relationship between flood-related injuries and fatalities and their distribution among the states could provide insights into the human cost of these natural disasters and understand how dangerous flood is.

# Summarize death counts by state
death_counts_by_state <- dangerousflood %>%
  group_by(STATE) %>%
  summarise(Total_Deaths = sum(DEATHS_DIRECT + DEATHS_INDIRECT))

# Summarize injury counts by state
injury_counts_by_state <- dangerousflood %>%
  group_by(STATE) %>%
  summarise(Total_Injuries = sum(INJURIES_DIRECT + INJURIES_INDIRECT))

# Base plot with death counts in green
p_combined <- plot_ly(death_counts_by_state, x = ~reorder(STATE, -Total_Deaths), y = ~Total_Deaths, type = 'bar', marker = list(color = 'green'), name = 'Deaths') %>%
  layout(title = "Total Counts Due to Floods by State",
         xaxis = list(title = "State", tickangle = 45),
         yaxis = list(title = "Total Counts"))

# Add injury counts to the same plot in blue
p_combined <- p_combined %>%
  add_trace(data = injury_counts_by_state, x = ~reorder(STATE, -Total_Injuries), y = ~Total_Injuries, type = 'bar', marker = list(color = 'blue'), name = 'Injuries')

# Render the combined plot
p_combined

The data indicates that Tennessee and New York experience the highest numbers of deaths and injuries due to flooding, suggesting these states are particularly vulnerable to the more perilous consequences of such events. Texas and Arizona also report significant numbers, where probably due to the geographic location.

# Convert BEGIN_DATE_TIME to Date format
dangerousflood$BEGIN_DATE <- as.Date(dangerousflood$BEGIN_DATE_TIME, format = "%d-%b-%y %H:%M:%S")

# Aggregate injuries and deaths by date
injuries_deaths_by_date <- dangerousflood %>%
  group_by(BEGIN_DATE) %>%
  summarise(Total_Injuries = sum(INJURIES_DIRECT + INJURIES_INDIRECT),
            Total_Deaths = sum(DEATHS_DIRECT + DEATHS_INDIRECT))

# Time Series Line Plot of Total Injuries Over Time
p_injuries <- plot_ly(data = injuries_deaths_by_date, x = ~BEGIN_DATE, y = ~Total_Injuries, type = 'scatter', mode = 'lines', line = list(color = 'cyan4'), name = 'Injuries') %>%
  layout(title = "Time Series of Total Injuries & Deaths Over Time",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Total Injuries"))

# Time Series Line Plot of Total Deaths Over Time
p_deaths <- plot_ly(data = injuries_deaths_by_date, x = ~BEGIN_DATE, y = ~Total_Deaths, type = 'scatter', mode = 'lines', line = list(color = 'red'), name = 'Deaths') %>%
  layout(title = "Time Series of Total Injuries & Deaths Over Time",
         xaxis = list(title = "Date"),
         yaxis = list(title = "Total Deaths"))

# Arrange the plots in a grid
subplot(p_injuries, p_deaths, nrows = 2, shareX = TRUE, titleX = FALSE)

What I did here is to do a time series of how injuries and deaths occured in this particular timeframe. It turned out that between July 2021 and October 2021, a period that coincides with some of the most devastating floods. This trend underscores the heightened risk of serious harm during such events, as evidenced by the consistent occurrence of deaths throughout this timeframe. Thus, we can’t conclude flood as a seasonal event.This data highlights the critical importance of effective flood management and emergency response measures all over the year, particularly during periods of high risk.

EDA ON EVENT_TYPE / CAUSE

The dataset contains four flood types and I am intrigued by the potential connections between flood type and causes. Thus, I built a contingency table that show all the detailed information. Such a table would organize the data to show the frequency of each cause within each flood type, allowing you to identify patterns or trends in the data.

# Contingency Table without NA values
# Contingency Table
table(dangerousflood$FLOOD_CAUSE, dangerousflood$EVENT_TYPE)
##                               
##                                Coastal Flood Flash Flood Flood Lakeshore Flood
##                                          404           0     0             156
##   Dam / Levee Break                        0          12     5               0
##   Heavy Rain                               0        7918  4455               0
##   Heavy Rain / Burn Area                   0         136     3               0
##   Heavy Rain / Snow Melt                   0          33   215               0
##   Heavy Rain / Tropical System             0         266    27               0
##   Ice Jam                                  0           1    26               0
##   Planned Dam Release                      0           0     4               0

The contingency table reveals a distinct distribution of causes across different flood types, with ‘Heavy Rain’ being the predominant cause for both ‘Flash Flood’ and the more general category labeled as ‘Flood.’ However, for ‘Lakeshore Flood’ and ‘Coastal Flood,’ the dataset does not provide information on the causes, leaving a gap in our understanding of the factors that contribute to these specific types of flooding. This absence of data prevents us from drawing concrete conclusions about what triggers ‘Lakeshore Flood’ and ‘Coastal Flood’ events, highlighting an area where further data collection and research are necessary to complete the picture of flood causality.

Flood occurences in the country

Thus, we created a new subset called ‘coastal_lakeshore_floods’, which only contains ‘Lakeshore Flood’ and ‘Coastal Flood’.

coastal_lakeshore_floods <- dangerousflood %>%
  filter(grepl("Coastal Flood", EVENT_TYPE) | grepl("Lakeshore Flood", EVENT_TYPE))

I would like toexplore the relationship between flood occurrences and location.

# Filter for specific flood events using grepl for a pattern match
dangerousflood_filtered <- dangerousflood %>%
  filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE)) %>%
  filter(between(LATITUDE, 24, 49.3843), between(LONGITUDE, -125, -66.93457)) # Bounds for the continental US

# Get the map data for the US
us_map <- map_data("state")

# Plot the map with flood events
ggplot() +
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(data = dangerousflood_filtered, aes(x = LONGITUDE, y = LATITUDE, color = EVENT_TYPE), size = 1, alpha = 0.8) +
  scale_color_viridis_d(guide = guide_legend(override.aes = list(size=4))) + # Use a larger size for legend items
  labs(title = "Flood Events in the United States by Type", x = "Longitude", y = "Latitude") +
  theme_minimal()

The absence of precise longitude and latitude data for coastal and lakeshore floods in the dataset does present a limitation for detailed geospatial analysis. Nevertheless, it is reasonable to infer that such events predominantly occur along coastlines and lake shores. Despite this data gap, the overall trend suggests a higher incidence of floods in the eastern regions of the United States. This pattern may be influenced by a variety of factors, including regional climate conditions, topography, and the presence of water bodies such as rivers and estuaries that are prone to overflowing.

In contrast, the Western states, aside from coastal areas like California, Oregon, and Washington, report fewer flood events. This could reflect the region’s different climate and landscape characteristics, which include arid areas and mountainous terrains that are less susceptible to flooding compared to the flat and water-rich Eastern landscapes. However, coastal Western states, with their proximity to the Pacific Ocean, do experience their share of flood events, potentially related to Pacific storms and seasonal weather patterns.

# Summarize the data to get the count of flood events per type per state and average coordinates
dangerousflood_summary <- dangerousflood_filtered %>%
  group_by(STATE, EVENT_TYPE) %>%
  summarize(Count = n(),
            AVG_LATITUDE = mean(LATITUDE, na.rm = TRUE),
            AVG_LONGITUDE = mean(LONGITUDE, na.rm = TRUE)) %>%
  ungroup()
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
# Get the map data for the US
us_map <- map_data("state")

# Define custom colors for the flood types
custom_colors <- c("Flood" = "blue", "Flash Flood" = "orange")

# Plot the map with flood events, varying the size of the dot by the count of events and using custom colors
ggplot() +
  geom_polygon(data = us_map, aes(x = long, y = lat, group = group), fill = "white", color = "black") +
  geom_point(data = dangerousflood_summary, aes(x = AVG_LONGITUDE, y = AVG_LATITUDE, size = Count, color = EVENT_TYPE), alpha = 0.6) +
  scale_size(range = c(1, 6)) + # Adjust the range for the size of the points as needed, making them smaller
  scale_color_manual(values = custom_colors) + # Use custom colors
  labs(title = "Flood Events in the United States by Type", x = "Longitude", y = "Latitude") +
  theme_minimal() +
  theme(legend.position = "bottom") # Move legend to bottom for better visibility

# Remove the summary warning messages
dplyr::last_dplyr_warnings()
## list()

Now I gave a state-based visualization of flood events, categorized by type and delineated by state, reveals a pronounced concentration of larger circles—denoting a higher number of floods—in the eastern regions. This correlates with the elevated levels of property damage recorded in these areas. On the other hand, the West Coast, particularly Oregon, and Washington, displays a distinct pattern: these states experience a moderate prevalence of flood and flash flood events, yet with a comparatively lower impact on property damage (except for California).

Flood Type vs Financial Loss

Property Damage

Now financial losses can be high related with different types of flood that I would like to pursue investigating.

# Function to convert damage property values to numeric
convert_damage <- function(damage) {
  if (is.na(damage) || damage == "") {
    return(NA)
  }
  
  # Remove commas and convert to upper case
  damage <- toupper(gsub(",", "", damage))
  
  # Convert based on suffixes
  if (grepl("K", damage)) {
    return(as.numeric(gsub("K", "", damage)) * 1e3)
  } else if (grepl("M", damage)) {
    return(as.numeric(gsub("M", "", damage)) * 1e6)
  } else if (grepl("B", damage)) {
    return(as.numeric(gsub("B", "", damage)) * 1e9)
  } else {
    return(as.numeric(damage))
  }
}

# Apply the function to the DAMAGE_PROPERTY column
dangerousflood$DAMAGE_PROPERTY_NUM <- sapply(dangerousflood$DAMAGE_PROPERTY, convert_damage, USE.NAMES = FALSE)

Thus, I create a graph visualization below:

# Summarize the damage by event type, omitting NAs
damage_summary <- dangerousflood %>%
  filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
  filter(!is.na(DAMAGE_PROPERTY_NUM)) %>%
  group_by(EVENT_TYPE) %>%
  summarize(Total_Damage = sum(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
            Average_Damage = mean(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
            .groups = 'drop') # Drop grouping structure after summarizing

# Create a bar plot to compare total damage by event type
ggplot(damage_summary, aes(x = reorder(EVENT_TYPE, Total_Damage), y = Total_Damage, fill = EVENT_TYPE)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Total Property Damage by Flood Event Type", x = "Event Type", y = "Total Damage (USD)") +
  scale_fill_viridis_d() # Use a viridis color scale for the fill

Flash floods stand out as the most financially destructive, incurring property damage expenditures that far exceed those of general floods—often referred to simply as ‘floods’ in the dataset—by approximately threefold. The scarcity of occurrences of lakeshore and coastal floods in the dataset precludes a comprehensive assessment of their average financial toll. However, the available figures suggest that while these flood types may be less frequent, when they do occur, the property damage might be significant. Thus, I ploted another grpah to indicate average losses.

# Calculate the average damage by event type, omitting NAs
average_damage_summary <- dangerousflood %>%
  filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
  filter(!is.na(DAMAGE_PROPERTY_NUM)) %>%
  group_by(EVENT_TYPE) %>%
  summarize(Average_Damage = mean(DAMAGE_PROPERTY_NUM, na.rm = TRUE),
            .groups = 'drop') # Drop grouping structure after summarizing

# Print the average damage summary
print(average_damage_summary)
## # A tibble: 4 × 2
##   EVENT_TYPE      Average_Damage
##   <chr>                    <dbl>
## 1 Coastal Flood           28090.
## 2 Flash Flood            346041.
## 3 Flood                  125967.
## 4 Lakeshore Flood        341298.
# Create a bar plot to compare average damage by event type
ggplot(average_damage_summary, aes(x = reorder(EVENT_TYPE, -Average_Damage), y = Average_Damage, fill = EVENT_TYPE)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Property Damage by Flood Event Type", x = "Event Type", y = "Average Damage (USD)") +
  scale_fill_viridis_d() # Use a viridis color scale for the fill

The data and plot suggess that lakeshore flood, while less frequent, tend to be more devastating on average when they do occur. This could be attributed to the specific characteristics of lakeshore floods, which may impact areas with high property values or cause more extensive damage due to their nature. In contrast, general floods, often simply referred to as ‘floods,’ appear to be the least destructive on a per-event basis, which might earn them the moniker of being relatively ‘gentle’ floods. Nonetheless, it’s important to note that the average property damage per flood event across all types is substantial, drawing attention to the significant financial burden these natural disasters can impose.

Crops Damage
# Function to convert damage crops values to numeric
convert_damage_crops <- function(damage) {
  if (is.na(damage) || damage == "") {
    return(NA)
  }
  
  # Remove commas and convert to upper case
  damage <- toupper(gsub(",", "", damage))
  
  # Convert based on suffixes
  if (grepl("K", damage)) {
    return(as.numeric(gsub("K", "", damage)) * 1e3)
  } else if (grepl("M", damage)) {
    return(as.numeric(gsub("M", "", damage)) * 1e6)
  } else if (grepl("B", damage)) {
    return(as.numeric(gsub("B", "", damage)) * 1e9)
  } else {
    return(as.numeric(damage))
  }
}

# Apply the function to the DAMAGE_CROPS column
dangerousflood$DAMAGE_CROPS_NUM <- sapply(dangerousflood$DAMAGE_CROPS, convert_damage_crops, USE.NAMES = FALSE)
# Calculate the average crop damage by event type, omitting NAs
average_crop_damage_summary <- dangerousflood %>%
  filter(grepl("Flood|Lakeshore Flood|Flash Flood|Coastal Flood", EVENT_TYPE, ignore.case = TRUE)) %>%
  filter(!is.na(DAMAGE_CROPS_NUM)) %>%
  group_by(EVENT_TYPE) %>%
  summarize(Average_Crop_Damage = mean(DAMAGE_CROPS_NUM, na.rm = TRUE),
            .groups = 'drop') # Drop grouping structure after summarizing

# Print the average crop damage summary
print(average_crop_damage_summary)
## # A tibble: 4 × 2
##   EVENT_TYPE      Average_Crop_Damage
##   <chr>                         <dbl>
## 1 Coastal Flood                 5602.
## 2 Flash Flood                  15153.
## 3 Flood                         1235.
## 4 Lakeshore Flood                  0
# Create a bar plot to compare average crop damage by event type
ggplot(average_crop_damage_summary, aes(x = reorder(EVENT_TYPE, -Average_Crop_Damage), y = Average_Crop_Damage, fill = EVENT_TYPE)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Average Crop Damage by Flood Event Type", x = "Event Type", y = "Average Damage (USD)") +
  scale_fill_viridis_d() # Use a viridis color scale for the fill

Flash floods emerge as the predominant culprits for crop losses, underscoring their intense and often localized impact on agricultural areas. General floods also contribute to a considerable extent to the average crop damage, although to a lesser degree than flash floods. In stark contrast, coastal and lakeshore floods are found to have a negligible effect on crop damage. This lack of impact could be due to the geographical locations of such floods, which may not typically overlap with agricultural zones. These findings highlight the variable threats that different types of floods pose to agricultural livelihoods and can inform tailored approaches to agricultural risk management and resilience building.

FLood Type vs Injuries/Deaths

dangerousflood <- dangerousflood %>%
  mutate(Total_Injuries = INJURIES_DIRECT + INJURIES_INDIRECT,
         Total_Deaths = DEATHS_DIRECT + DEATHS_INDIRECT)

# Summarize total injuries and deaths by flood type
injuries_deaths_by_type <- dangerousflood %>%
  group_by(EVENT_TYPE) %>%
  summarise(Total_Injuries = sum(Total_Injuries, na.rm = TRUE),
            Total_Deaths = sum(Total_Deaths, na.rm = TRUE)) %>%
  ungroup()

# Reshape the data into long format
injuries_deaths_long <- injuries_deaths_by_type %>%
  pivot_longer(cols = c(Total_Injuries, Total_Deaths), names_to = "Metric", values_to = "Count")

# Create the plot
ggplot(injuries_deaths_long, aes(x = EVENT_TYPE, y = Count, fill = Metric)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  scale_fill_manual(values = c("cyan4", "maroon")) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "top") + # Move legend to the top for better readability
  labs(title = 'Injuries and Deaths by Flood Type',
       x = 'Flood Type',
       y = 'Count',
       fill = 'Metric')

Flash floods are a deadly phenomenon that can claim lives with their sudden onset and rapid escalation, as evidenced by the recorded fatalities surpassing 200, while injuries stand at approximately 75. The data suggests that the lethality of flash floods is significantly higher compared to other flood types, with deaths outnumbering injuries. This trend is also observed with general floods, albeit with a lower overall count of affected individuals. The absence of fatalities associated with lakeshore and coastal floods in the records could indicate a lower human impact, or it may reflect data limitations. As flash floods are particularly lethal, with a higher number of fatalities relative to injuries, government should be careful for such type of flood as it might have very limited reaction time for people to evacuate.

EDA on Timeline

# Aggregate the data to count the number of flood events per state for each year
yearly_state_flood_counts <- dangerousflood %>%
  count(STATE, YEAR) %>%
  mutate(STATE = tolower(STATE)) # Ensure state names match map data

# Get US map data
states_map <- map_data("state")

# Merge map data with yearly flood counts
map_data_merged <- merge(states_map, yearly_state_flood_counts, by.x = "region", by.y = "STATE", all.x = TRUE)

# Plot the map with facets for each year
ggplot(map_data_merged, aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = n), color = "white") +
  scale_fill_gradient(low = "lightblue", high = "darkred", na.value = "grey50", name = "Flood Counts") +
  facet_wrap(~YEAR) + # Facets for each year
  labs(title = "Yearly Flood Event Counts by State") +
  theme_void() +
  theme(strip.text.x = element_text(size = 8), # Make facet labels smaller
        legend.position = "bottom")

The plot of flood occurrences from 2020 to 2021 reveals a marked escalation in certain states, most notably Texas and New York. This upward trend in flood events signals a critical need for heightened awareness and proactive measures from local and federal government agencies.

This statement underscores the importance of the data findings and suggests the need for action based on the trend observed.

Conclusion

In reviewing the comprehensive NOAA dataset, it becomes evident that floods pose a significant risk, often resulting in fatalities rather than just injuries. The distribution of flood occurrences varies notably across different regions of the United States, with some states experiencing an increase in flood events in 2021 compared to the previous year. These incidents have led to a substantial number of deaths and injuries, underscoring the importance of effective disaster preparedness and response strategies. In particular, flash floods emerge as exceptionally perilous, causing considerable harm to individuals and communities. Furthermore, lakeshore floods, while less frequent, have the potential to inflict substantial property damage. Coastal areas, due to their relatively high incidence of flooding and associated economic losses, warrant special attention for mitigation and safety measures. Collectively, these insights highlight the critical need for heightened flood awareness and preparedness to safeguard lives and property against this natural disaster.